ZhurnalyWiki Surge Control

 

A few days ago I received a warning notice from my ISP, reporting that zhurnaly.com usage statistics have shot up in recent months. Already in November 2009 the volume of pages served has gone above 10 GB. This is unhappy, since my current service plan gives me only 10 GB before surcharges begin. More than half the bandwidth has been used by hosts identifying themselves as "crawl-66-249-65-187.googlebot.com", "crawl-66-249-71-245.googlebot.com", "crawl-66-249-65-186.googlebot.com", and "crawl-66-249-65-129.googlebot.com". These could be normal Google web-crawler robots, I suppose, or perhaps imposters attempting to insert spam in the wiki.

It's unclear what I should do. For the moment, I've set the "surge protection" on the wiki engine to a stricter level, 4 pages per 20 seconds, instead of the default 10 per 20. This shouldn't affect most human users, I think, but if you find it annoying please contact me (email z "at-sign" his "dot" com) and I'll try to tune it better.

Any other suggestions? The log files show entries like this:

66.249.65.39 - - [23/Nov/2009:22:21:31 -0500] "GET /cgi-bin/wiki/HAT%20Run%202008/HomePage/Zhurnal_and_Zhurnaly/TopicLanguage/ConfoundedConflation/TopicLanguage/DangerousLiterature/ChekhovOnTolstoy/GlobeOfLife/TruthInBattle HTTP/1.1" 200 11759 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.39 - - [23/Nov/2009:22:21:32 -0500] "GET /cgi-bin/wiki/HAT%20Run%202008/SigilOfPower/TopicLanguage/JournalBearing/ReadLikely/TopicHumor/ConfoundedConflation/LaterDude/TopicPersonalHistory/LongDistanceFriendliness/Comments_on_LongDistanceFriendliness HTTP/1.1" 404 8175 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.39 - - [23/Nov/2009:22:21:32 -0500] "GET /cgi-bin/wiki/2004-08-07_-_Robert_Frost_Trail_(northeast)/HomePage/Bo_Leuf,_R.I.P./In_Memoriam/HighTension/TopicScience/MardiGras/Comments_on_MardiGras HTTP/1.1" 404 7234 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.39 - - [23/Nov/2009:22:21:33 -0500] "GET /cgi-bin/wiki/HAT%20Run%202008/HatRun2004/TopicRunning/HandicapJogging/TopicScience/HansBethe/TopicPersonalHistory/IntestinalInfortitude/2006-08-12_-_Iwo_Jima_Jog/HomePage/Comments_on_HomePage HTTP/1.1" 503 1894 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

... that is, a GET command every second or so for ill-formed URLs with lots of ZhurnalyWiki page names separated by slashes. I don't fully understand these log entries, but perhaps it's part of an automated out-of-control system not actually coming from Google? Or is Google just merrily crawling my pages?

This is another thing that I wish I didn't have to think about! Do I have to turn off the public ZhurnalyWiki entirely and only offer non-interactive pages? Or should I be cheerful that my pages are getting indexed in Google?

(cf. WebLogAnalysis (2001-06-02), VisitorStats (2003-10-17), ...) - ^z - 2009-11-24